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Abstract 

We address online linear optimization problems when the possible actions of the decision 
maker ai^e represented by binary vectors. The regret of the decision maker is the difference 
between her realized loss and the minimal loss she would have achieved by picking, in hind- 
sight, the best possible action. Our goal is to understand the magnitude of the best possible 
(minimax) regret. We study the problem under three different assumptions for the feedback 
the decision maker receives: full infomiation, and the partial infonnation models of the so- 
called "semi-bandit" and "bandit" problems. In the full information case we show that the 
standard exponentially weighted average forecaster is a provably suboptimal strategy. For the 
semi-bandit model, by combining the Mirror Descent algorithm and the INF (Implicitely Nor- 
malized Forecaster) strategy, we are able to prove the first optimal bounds. Finally, in the 
bandit case we discuss existing results in light of a new lower bound, and suggest a conjecture 
on the optimal regret in that case. 

Introduction. 

this paper we consider the framework of online linear optimization. The setup may be described 
a repeated game between a "decision maker" (or simply "player" or "forecaster") and an "adver- 
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sary" as follows: at each time instance t = 1, . . . , n, the player chooses, possibly in a randomized 
way, an action from a given finite action set A C The action chosen by the player at time t is 
denoted by at E A. Simultaneously to the player, the adversary chooses a loss vector Zt E Z C 
and the loss incurred by the forecaster is afzf. The goal of the player is to minimize the expected 
cumulative loss E Xl"=i '^I^t where the expectation is taken with respect to the player's intemal 
randomization (and eventually the adversary's randomization). 

In the basic "full-information" version of this problem, the player observes the adversary's 
move Zt at the end of round t. Another important model for feedback is the so-called bandit 
problem, in which the player only observes the incurred loss Zt. As a measure of performance 
we define the regret ' of the player as 



In this paper we address a specific example of online linear optimization: we assume that the action 
set ^ is a subset of the (i-dimensional hypercube {0, l}'' such that Va E ||a||i = m, and the 
adversary has a bounded loss per coordinate, that is^ Z = [0, 1]''. We call this setting online com- 
binatorial optimization. As we will see below, this restriction of the general framework contains 
a rich class of problems. Indeed, in many interesting cases, actions are naturally represented by 
Boolean vectors. 

In addition to the full information and bandit versions of online combinatorial optimization, 
we also consider another type of feedback which makes sense only in this combinatorial setting. 
In the semi-bandit version, we assume that the player observes only the coordinates of Zt that 
were played in at, that is the player observes the vector {at{l)zt{l), . . . ,at{d)zt{d)). All three 
variants of online combinatorial optimization are sketched in Figure 1 . More rigorously, online 
combinatorial optimization is defined as a repeated game between a "player" and an "adversary." 
At each round t = 1, . . . , of the game, the player chooses a probability distribution pt over the 
set of actions A C {0, l}*^ and draws a random action at E A according to pt. Simultaneously, the 
adversary chooses a vector zt E [0, l]'^. More formally, zt is a measurable function of the "past" 
(ps, as, Zs)s=i,...,t-i- In the full information case, pt is a measurable function of {ps, a^, Zs)s=i,...,t-i- 
In the semi-bandit case, pt is a measurable function of {ps, a^, {as{i)zs{i))i=i^,,,^d)s=i,...,t-i and in 
the bandit problem it is a measurable function of {ps, as, {a'^ Zs))s=i,...,t-i- 

1.1 Motivating examples. 

Many problems can be tackled under the online combinatorial optimization framework. We give 
here three simple examples: 

• m-sets. In this example we consider the set A of all (^) Boolean vectors in dimension d 
with exactly m ones. In other words, at every time step, the player selects m actions out of 

'in the full information version, it is straightforward to obtain upper bounds for the stronger notion of regret 
TiiJ2t=i '^I^t — IE niinaG.A X]r=i ^'^ which is always at least as large as i?„. However, for partial information 
games, this requires more work. In this paper we only consider i?„ as a measure of the regret. 

^Note that since all actions have the same size, i.e. ||a||i — m,Va G one can reduce the case of Z = [a,/?]'' to 
Z = [0, 1]^ via a simple renormalization. 
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Parameters: set of actions A C {0, l}*^; number of rounds n G N. 
For each round t = 1, 2, . . . , n; 

(1) the player chooses a probabihty distribution pt over A and draws a random action at € A ac- 
cording to Pt, 

(2) simultaneously, the adversary selects a loss vector zt E [0, l]'^ (without revealing it); 

(3) the player incurs the loss ajzt- She observes 

- the loss vector zt in the full information setting, 

- the coordinates zt{i)at{i) in the semi -bandit setting, 

- the instantaneous loss aj zt in the bandit setting. 

Goal: The player tries to minimize her cumulative loss ^11^=1 '^I ^t- 



Figure 1: Online combinatorial optimization. 

d possibilities. When m = 1, the semi -bandit and bandit versions coincide and correspond 
to the standard (adversarial) multi-armed bandit problem. 

• Online shortest path problem. Consider a communication network represented by a graph 
in which one has to send a sequence of packets from one fixed vertex to another. For each 
packet one chooses a path through the graph and suffers a certain delay which is the sum of 
the delays on the edges of the path. Depending on the traffic, the delays on the edges may 
change, and, at the end of each round, according to the assumed level of feedback, the player 
observes either the delays of all edges, the delays of each edge on the chosen path, or only 
the total delay of the chosen path. The player's objective is to minimize the total delay for 
the sequence of packets. 

One can represent the set of valid paths from the starting vertex to the end vertex as a set 
A C {0, l}'^ where d is the number of edges. If at time t, zt E [0, l]*^ is the vector of delays 
on the edges, then the delay of a path a G ^ is zfa. Thus this problem is an instance of 
online combinatorial optimization in dimension d, where d is the number of edges in the 
graph. In this paper we assume, for simplicity, that all valid paths have the same length m. 

• Ranking. Consider the problem of selecting a ranking of m items out of M possible items. 
For example a website could have a set of M ads, and it has to select a ranked list of m of 
these ads to appear on the webpage. One can rephrase this problem as selecting a matching of 
size m on the complete bipartite graph Km,Ai (with d = mxM edges). In the online learning 
version of this problem, each day the website chooses one such list, and gains one dollar for 
each click on the ads. This problem can easily be formulated as an online combinatorial 
optimization problem. 

Our theory applies to many more examples, such as spanning trees (which can be useful in certain 
communication problems), or m-intervals. 



3 



1.2 Previous work. 



• Full Information. The full-information setting is now fairly well understood, and an op- 
timal regret bound (in terms of m,d, n) was obtained by Koolen, Warmuth, and Kivinen 
[26]. Previous papers under full information feedback also include Gentile and Warmuth 
[14], Kivinen and Warmuth [ :*>], Grove, Littlestone, and Schuurmans [15], Takimoto and 
Warmuth [ ], Kalai and Vempala [22], Warmuth and Kuzmin [ ], Herbster and Warmuth 
[19], and Hazan, Kale, and Warmuth [18]. 

• Semi-bandit. The first paper on the adversarial multi-armed bandit problem (i.e., the special 
case of m-sets with m = 1) is by Auer, Cesa-Bianchi, Freund, and Schapire [ ] who derived 
a regret bound of order ^/drl\ogd. This result was improved to \fd/n by Audibert and Bubeck 
[2, 3]. Gyorgy, Linder, Lugosi, and Ottucsak [16] consider the online shortest path problem 
and derive suboptimal regret bounds (in terms of the dependency on m and d). Uchiya, 
Nakamura, and Kudo [35] (respectively Kale, Reyzin, and Schapire [_: ]) derived optimal 
regret bounds for the case of m-sets (respectively for the problem of ranking selection) up to 
logarithmic factors. 

• Bandit. McMahan and Blum ["^7], and Awerbuch and Kleinberg [5] were the first to consider 
this setting, and obtained suboptimal regret bounds (in terms of n). The first paper with 
optimal dependency in n was by Dani, Hayes, and Kakade [12]. The dependency on m and 
d was then improved in various ways by Abemethy, Hazan, and Rakhlin [ ], Cesa-Bianchi, 
and Lugosi [11], and Bubeck, Cesa-Bianchi, and Kakade [ ]. We discuss these bounds in 
detail in Section 4. In particular, we argue that the optimal regret bound in terms of d (and 
m) is still an open problem. 

We also refer the interested reader to the recent survey [8] for an overview of bandit problems in 
various other settings. 

1.3 Contribution and contents of the paper. 

In this paper we are primarily interested in the optimal minimax regret in terms of m, d and n. More 
precisely, our aim is to determine the order of magnitude of the following quantity: For a given 
feedback assumption, write sup for the supremum over all adversaries and inf for the infimum 
over all allowed strategies for the player under the feedback assumption. (Recall the definition of 
"adversary" and "player" from the introduction.) Then we are interested in 

max inf sup 

^C{0,l}d:Vaeyl,||a||i=m 

Our contribution to the study of this quantity is threefold. First, we unify the algorithms used 
in Abemethy, Hazan, and Rakhlin [i ], Koolen, Warmuth, and Kivinen [26], Uchiya, Nakamura, 
and Kudo [35], and Kale, Reyzin, and Schapire [23] under the umbrella of mirror descent. The 
idea of mirror descent goes back to Nemirovski [28], Nemirovski and Yudin [29]. A somewhat 
similar concept was re-discovered in online learning by Herbster and Warmuth [20], Grove, Little- 
stone, and Schuurmans [L"^], Kivinen and Warmuth [iD] under the name of potential-based gradient 
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Table 1: Bounds on the minimax regret (up to constant factors). The new results are set in boldface. In this 
paper we also show that EXP2 in the full information case has a regret bounded below by d^/^ ^/n (when m 
is of order d). 

descent, see [iU, Chapter 11]. Recently, these ideas have been flourishing, see for instance Shalev- 
Schwartz [33], Rakhlin [ju], Hazan [1 /], and Bubeck [ ]. Our main theorem (Theorem 2) allows 
one to recover almost all known regret bounds for online combinatorial optimization. This first 
contribution leads to our second main result, the improvement of the known upper bounds for the 
semi-bandit game. In particular, we propose a different proof of the minimax regret bound of the 
order of \/nd in the standard d-armed bandit game that is much simpler than the one provided 
in Audibert and Bubeck [.>] (which also improves the constant factor). In addition to these upper 
bounds we prove two new lower bounds. First we answer a question of Koolen, Warmuth, and 
Kivinen [26] by showing that the exponentially weighted average forecaster is provably subopti- 
mal for online combinatorial optimization. Our second lower bound is a minimax lower bound in 
the bandit setting which improves known results by an order of magnitude. A summary of known 
bounds and the new bounds proved in this paper can be found in Table 1 . 

The paper is organized as follows. In Section 2 we introduce the two algorithms discussed 
in this paper. In particular in Section 2.1 we discuss the popular exponentially weighted average 
forecaster and we show that it is a provably suboptimal strategy. Then in Section 2.2 we describe 
our main algorithm, OSMD (Online Stochastic Mirror Descent), and prove a general regret bound in 
terms of the Bregman divergence of the Fenchel-Legendre dual of the Legendre function defining 
the strategy. In Section 3 we derive upper bounds for the regret in the semi-bandit case for OSMD 
with appropriately chosen Legendre functions. Finally in Section 4 we prove a new lower bound 
for the bandit setting, and we formulate a conjecture on the correct order of magnitude of the regret 
for that problem based on this new result and the regret bounds obtained in [1, 9]. 

2 Algorithms. 

In this section we discuss two classes of algorithms that have been proposed for online combina- 
torial optimization. 

2.1 Expanded Exponential weights (exp2). 

The simplest approach to online combinatorial optimization is to consider each action of A as 
an independent "expert," and then apply a generic regret minimizing strategy. Perhaps the most 
popular such strategy is the exponentially weighted average forecaster (see, e.g., [10]). (This 
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strategy is sometimes called Hedge, see Freund and Schapire [13].) We call the resulting strategy 
for the online combinatorial optimization problem exp2, see Figure 2. In the full information 
setting, exp2 corresponds to "Expanded Hedge," as defined in Koolen, Warmuth, and Kivinen 
[26]. In the semi-bandit case, exp2 was studied by Gyorgy, Linder, Lugosi, and Ottucsak [15] 
while in the bandit case in Dani, Hayes, and Kakade [12], Cesa-Bianchi and Lugosi [11], and 
Bubeck, Cesa-Bianchi, and Kakade [9]. Note that in the bandit case, exp2 is mixed with an 
exploration distribution, see Section 4 for more details. 

Despite strong interest in this strategy, no optimal regret bound has been derived for it in the 
combinatorial setting. More precisely, the best bound (which can be derived from a standard 

argument, see for example [12] or [26]) is of order m^/^ ^n\og (^). On the other hand, in [26] 
the authors showed that by using Mirror Descent (see next section) with the negative entropy, one 
obtains a regret bounded by m-y/n log (^). Furthermore this latter bound is clearly optimal (up 
to a numerical constant), as one can see from the standard lower bound in prediction with expert 
advice (consider the set A that corresponds to playing m. expert problems in parallel with d/m 
experts in each problem). In [26] the authors leave as an open question the problem of whether it 
would be possible to improve the bound for exp2 to obtain the optimal order of magnitude. The 
following theorem shows that this is impossible, and that in fact exp2 is a provably suboptimal 
strategy. 

Theorem 1 Let n > d. There exists a subset Ac {0, l}'^ such that in the full information setting, 
the regret of the EXP2 strategy (for any learning rate rj), satisfies 

sup Rn > 0.01 d^/^Vn. 

adversary 

The proof is deferred to the Appendix. 

2.2 Online Stochastic IMirror Descent. 

In this section we describe the main algorithm studied in this paper. We call it Online Stochastic 
Mirror Descent (OSMD). Each term in this name refers to a part of the algorithm: Mirror Descent 
originates in the work of Nemirovski and Yudin [^ ]. The idea of mirror descent is to perform a 
gradient descent, where the update with the gradient is performed in the dual space (defined by 
some Legendre function F) rather than in the primal (see below for a precise formulation). The 
Stochastic part takes its origin from Robbins and Monro [31] and from Kiefer and Wolfowitz [24]. 
The key idea is that it is enough to observe an unbiased estimate of the gradient rather than the true 
gradient in order to perform a gradient descent. Finally the Online part comes from Zinkevich [37]. 
Zinkevich derived the Online Gradient Descent (OGD) algorithm, which is a version of gradient 
descent tailored to online optimization. 

To properly describe the OSMD strategy, we recall a few concepts from convex analysis, see 
Hiriart-Urruty and Lemarechal [21] for a thorough treatment of this subject. Let V C he an 
open convex set, and V the closure of V. 

Definition 1 We call Legendre any continuous function F : P — ?■ M such that 
( i) F is strictly convex continuously differentiable on V, 
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exp2.- 

Parameter: Learning rate t]. 

For each round t = 1, 2, . . . , n; 

(a) Play at ~ pt and observe 

- the loss vector Zt in the full information game, 

- the coordinates Zt{i)latii)=i in the semi-bandit game, 

- the instantaneous loss afzt in the bandit game. 

(b) Estimate the loss vector zt by 5j. For instance, one may take 

- = Zt in the full information game. 



- Zt(i) = ^ rrCLfii) in the semi-bandit game. 



Zt(0 
l:a{i) = : 

- Zt = Pt^tttafzt, with Pt = Ea^pj(aa^) in the bandit game, 
(c) Update the probabilities, for all a E A, 

exp{-r]a^zt)pt{a) 



Pt+i[a) 



Figure 2: The EXP2 strategy. The notation E^^p^ denotes expected value with respect to the random choice 
of a when it is distributed according to pt- 

(ii) lim,_,^\^||VF(x)|| = +oo.^^ 
The Bregman divergence Dp : V x V associated to a Legendre function F is defined by 

Df{x, y) = F{x) - F{y) - {x - yfVF{y). 

Moreover, we say that V* = \/F(V) is the dual space ofV under F. We also denote by F* the 
Legendre -Fenchel transform ofF defined by 

F*{u) = sup (^x^u — F{x)^ . 

Lemma 1 Let F be a Legendre function. Then F** = F and VF* = (VF)^^ on the set V*. 
Moreover, Wx, y eV, 

Dp{x,y) = Dp,{VF{y),VF{x)). (1) 



^By the equivalence of norms in M**, this definition does not depend on the choice of the norm. 
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The lemma above is the key to understanding how a Legendre function acts on the space. The 
gradient VF maps V to the dual space V*, and VF* is the inverse mapping from the dual space 
to the original (primal) space. Moreover, (1) shows that the Bregman divergence in the primal 
space corresponds exactly to the Bregman divergence of the Legendre-Fenchel transform in the 
dual space. A proof of this result can be found, for example, in [Chapter 11, [10]]. 

We now have all ingredients to describe the OSMD strategy, see Figure 3 for the precise for- 
mulation. Note that step (d) is well defined if the following consistency condition is satisfied: 

VF(x) - r]Zt eV*,\/xE Conv{A) D V. (2) 

In the full information setting, algorithms of this type were studied by Abemethy, Hazan, and 
Rakhlin [ ], Rakhlin [30], and Hazan [17]. In these papers the authors adopted the presenta- 
tion suggested by Beck and TebouUe [6], which corresponds to a FoUow-the-Regularized-Leader 
(FTRL) type strategy. There the focus was on F being strongly convex with respect to some norm. 
Moreover, in [ ' ] the authors also consider the bandit case, and switch to F being a self-concordant 
barrier for the convex hull of A (see Section 4 for more details). Another line of work studied this 
type of algorithms with F being the negative entropy, see Koolen, Warmuth, and Kivinen [ ] for 
the full information case and Uchiya, Nakamura, and Kudo [ ^.], Kale, Reyzin, and Schapire [23] 
for specific instances of the semi-bandit case. All these results are unified and described in details 
in Bubeck [ ]. In this paper we consider a new type of Legendre functions F inspired by Audibert 
and Bubeck [ ], see Section 3. 

Regarding computational complexity, OSMD is efficient as soon as the polytope Conv{A) can 
be described by a polynomial (in d) number of constraints. Indeed in that case steps (a)-(b) can be 
performed efficiently jointly (one can get an algorithm by looking at the proof of Caratheodory's 
theorem), and step (d) is a convex program with a polynomial number of constraints. In many 
interesting examples (such as m-sets, selection of rankings, spanning trees, paths in acyclic graphs) 
one can describe the convex hull of ^ by a polynomial number of constraints, see Schrijver [ ^ ]. 
On the other hand, there also exist important examples where this is not the case (such as paths on 
general graphs). Also note that for some specific examples it is possible to implement OSMD with 
improved computational complexity, see Koolen, Warmuth, and Kivinen [26]. 

In this paper we restrict our attention to the combinatorial learning setting in which ^ is a 
subset of {0, l}'^ and the loss is linear. However, one should note that this specific form of A plays 
no role in the definition of OSMD. Moreover, if the loss is not linear, then one can modify OSMD 
by performing a gradient update with a gradient of the loss (rather than the loss vector Zt). See 
Bubeck [7] for more details on this approach. 

The following result is at the basis of our improved regret bounds for OSMD in the semi-bandit 
setting, see Section 3. 

Theorem 2 Suppose that (2) is satisfied and the loss estimates are unbiased in the sense that 
'Kat^ptZt = Zt- Then the regret of the OSMD strategy satisfies 

< sup.,^F(a)-F(x,) ^ 1 ^^^^ / ^ _ ^^^^ ^ \ 
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OSMD; 




Parameters: 




• learning rate rj > 0, 




• Legendre function F defined on D D Conv{A). 




Let Xi G ais,mm^i^r<nr,,,( A^ Fix). 




For each round t = 1, 2, . . . , n; 




(a) Let pt be a distribution on the set A such that xt = Ea^p^a. 




(b) Draw a random action at according to the distribution pt and observe the feedback. 




(c) Based on the observed feedback, estimate the loss vector zt by it. 




(d) Let wt+i G V satisfy 




VF{wt+i) = VF{xt) - m. 


(3) 


(e) Project the weight vector Wt+i defined by (3) on the convex hull of A: 




Xt+i^ argmin DF{x,Wt+i)- 


(4) 


x&Conv{A) 





Figure 3: Online Stochastic Mirror Descent (OSMD). 



Proof Let a E A. Using that at and Zt are unbiased estimates of Xt and Zt, we have 

n n 

E^(at - afzt = ^Y^{xt - afzf 
t=i t=i 

Using (3), and applying the definition of the Bregman divergences, one obtains 

r/if {xt - a) = (a - Xtf{VF{wt+i) - VF{xt)) 

= Dpia, Xt) + Dpixt, wt+i) - Dpia, Wt+i). 

By the Pythagorean theorem for Bregman divergences (see, e.g.. Lemma 11.3 of [10]), we have 

Dpia, wt+i) > Dp{a, Xt+i) + Dp{xt+i,Wt+i), hence 

rjz^ixt -a) < Dp{a, Xt) + Dp{xt, Wt+i) - Dp{a, Xt+i) - Dp{xt+i, Wt+i) . 
Summing over t gives 

n n 

^r]z[{xt - a) < Dp{a,ai) - Dp{a,an+i) + ^ {Dp{xt,Wt+i) - Dp{xt+i,Wt+i)) . 
t=i t=i 
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By the nonnegativity of the Bregman divergences, we get 

n n 

V^i^t - a) < Dpia, ai) + 



t=i t=i 



From (1), one has Dpixt, Wt+i) = Dp* {yF{xt) — rjzt, VF{xt)). Moreover, by writing the first- 
order optimality condition for xi, one directly obtains Dp{a, Xi) < F{a) — F{xi) which concludes 
the proof. ■ 

Note that, if F admits an Hessian, denoted V^F, that is always invertible, then one can prove 
that, up to a third-order term (in ij) , the regret bound can be written as 



zt. (5) 



t=i 

The main technical difficulty is to control the third-order error term in this inequality. 



3 Semi-bandit feedback. 

In this section we consider online combinatorial optimization with semi-bandit feedback. As we 
already discussed, in the full information case Koolen, Warmuth, and Kivinen [ ] proved that 
OSMD with the negative entropy is a minimax optimal strategy. We first prove a regret bound when 
one uses this strategy with the following estimate for the loss vector: 

~ / zt{i)at{i) 

= 7T^- (6) 

Xt{l) 

Note that this is a valid estimate since it makes only use of {zt{l)at{l), . . . , Zt{d)at{d)). Moreover, 
it is unbiased with respect to the random draw of at from pt, since by definition, Eat^ptat(z) = 
Xt{i). In other words, Eaj^pj2t(z) = zt{i). 

Theorem 3 The regret of OSMD with F{x) = Yli=i ^ogXi — J2i=i ^* (cmdV = (0, +oo)'^j and 
any non-negative unbiased loss estimate Zt{i) > satisfies 



t=l i=l 



In particular, with the estimate (6) and V = \/ 2 ™^°^°''^ 



nd 



Rn < \l2mdn\og—. 

m 
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Proof One can easily see that for the negative entropy the dual space is V* = M.^. Thus, (2) is 
verified and OSMD is well defined. Moreover, again by straightforward computations, one can also 
see that 

Dp.(vF{x),VF{y)^ = f^ji^ 9 (^(VF(x) - VF(y))(2)^ , (7) 

2 

where 6(x) = exp(x) — 1 — x. Thus, using Theorem 2 and the facts that 9(x) < ^ for x < 
^^'^ Yl'i=i ^i(^) — '^^^ obtains 



R„ < 



^up„,^f(a)-F(x.) ^ 1 ^ ^^^^ / ^ _ ^^^^ , 



^ sup.,^F(a)-F(xO ^|^^^^^^^^^^^^, 

^ t=i i=i 

The proof of the first inequality is concluded by noting that: 



v;^ 1 / v;^ Xiii) \ \ d 

F[a) — F[Xi) < y xiU) log — T— < mlog > r— = mlog — 

The second inequality follows from 



Xt{l) 



Using the standard y/dn lower bound for the multi-armed bandit (which corresponds to the case 
where A is the canonical basis), see e.g., [Theorem 30, [ ]], one can directly obtain a lower bound 
of order \fmdn for our setting. Thus the upper bound derived in Theorem 3 has an extraneous 
logarithmic factor compared to the lower bound. This phenomenon already appeared in the basic 
multi-armed bandit setting. In that case, the extra logarithmic factor was removed in Audibert and 
Bubeck ['^J by resorting to a new class of strategies for the expert problem, called INF (Implicitely 
Normalized Forecaster). Next we generalize this class of algorithms to the combinatorial setting, 
and thus remove the extra logarithmic factor. First we introduce the notion of a potential and the 
associated Legendre function. 

Definition 2 Let a; > 0. A function ip : (— oo, a) — )■ Wj^for some a G M U {+00} is called an 
u-potential if it is convex, continuously differentiable, and satisfies 

lim ■il){x) = u , lim ip{x) = +00 , 

x—^—oo x—>a 

^'>0, / \ilj~\s)\ds < +00 . 

J ui 

For every potential ip we associate the function F^ defined onV = (u^+ooY by: 

F^x) = 5Z / ^^^\s)ds. 
i=i 
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In this paper we restrict our attention to 0-potentials which we will simply call potentials. A 
non-zero value of to may be used to derive regret bounds that hold with high probability (instead 
of pseudo-regret bounds, see footnote 1). 

The first order optimality condition for (4) implies that OSMD with F^, is a direct generalization 
of INF with potential 'tjj, in the sense that the two algorithms coincide when A is the canonical basis. 
Note, in particular, that with ip{x) = exp(x) we recover the negative entropy for F^. In [3], the 
choice of = (— x)'^ with q > 1 was recommended. We show in Theorem 4 that here, again, 
this choice gives a minimax optimal strategy. 

Lemma 2 Let ip be a potential. Then F = F^ is Legendre and for allu,v G V* = {—oo,aY such 
that < Vz G {1, . . . , d}, 

1 

i=l 

Proof A direct examination shows that F = is a Legendre function. Moreover, since VF*(m) = 
(VF)~1(m) = we obtain 

Dp* ^) = 5^ y ilj{s)ds - {ui - Vi)ip{vi)j . 
From a Taylor expansion, we get 

1 

Df*{u,v) <y~] max -tp' {s){ui - Vif . 
1=1 



Since the function %jj is convex, and Ui < Vi, we have 



max ip'{s) < max (wj, t>j)) < ip'{vi 
se[ui,Vi] 



which gives the desired result. 



Theorem 4 Let be a potential. The regret o/OSMD with F = F^ and any non-negative unbiased 
loss estimate zt satisfies 



n d 



^ sup„g^F(a) -F(a;i) , ^y^y^^ Zjii 



^2 



In particular, with the estimate (6), ip{x) = {—x)~'', q > l,andrj — / 2 ^/g ^ 



-1 ^1-2/9 n' 



I 2 

Rn < q\ mdn 

q-l 



With q = 2 this gives 



Rn < 2V2mdn 
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In the case m = 1, the above theorem improves the bound Rn < sVnd obtained in Theorem 
llof[^]. 

Proof First note that since V* = (—00, aY and Zt has non-negative coordinates, OSMD is well 
defined (that is, (2) is satisfied). 

The first inequality follows from Theorem 2 and the fact that ip'{^~^{s)) = • 

Let tlj{x) = (— a;)~^. Then ?/^"^(a;) = — and = —-^J2i=i^]~^^'^- In particular, 
note that by Holder's inequality, since X]f=i ^li'^) = 

d 

F(a)-F(xi) < ^— y < ^— m^''-^)/''^^/^. 

g — 1 Q — i- 

i=l 



Moreover, note that ^)'(x) = -x ^ and 



d ^ , .sr, d 



i=l 

which concludes the proof. 



4 Bandit feedback. 

In this section we consider online combinatorial optimization with bandit feedback. This setting is 
much more challenging than the semi-bandit case, and in order to obtain sublinear regret bounds all 
known strategies add an exploration component to the algorithm. For example, in exp2, instead 
of playing an action at random according to the exponentially weighted average distribution pt, 
one draws a random action from pt with probability 1 — 7 and from some fixed "exploration" 
distribution /i with probability 7. On the other hand, in OSMD, one randomly perturbs Xt to some 
Xt, and then plays at random a point in A such that on average one plays Xt- 

In Bubeck, Cesa-Bianchi, and Kakade [ 0, the authors study the EXp2 strategy with the explo- 
ration distribution fi supported on the contact points between the polytope Conv{A) and the John 
ellipsoid of this polytope (i.e., the ellipsoid of minimal volume enclosing the polytope). Using this 
method they are able to prove the best known upper bound for online combinatorial optimization 
with bandit feedback. They show that the regret of exp2 mixed with John's exploration (and with 
the estimate described in Figure 2) satisfies 



Rn < 2m^/^/3dr^log— . 

V m 

Our next theorem shows that no strategy can achieve a regret less than a constant times my/dn, 
leaving a gap of a factor of y'rnlog^. As we argue below, we conjecture that the lower bound is of 
the correct order of magnitude. However, improving the upper bound seems to require some sub- 
stantially new ideas. Note that the following bound gives limitations that no strategy can surpass, 
on the contrary to Theorem 1 which was dedicated to the exp2 strategy. 
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Theorem 5 Let n > d > 2m. There exists a subset A C {0, 1}'' such that \ \a\\i = m,\/a & A, 
under bandit feedback, one has 

inf sup Rn > 0.02mA/dri , (8) 

strategies adversaries 

where the infimum and the supremum are taken over the class of strategies for the "player" and 
for the "adversary " as defined in the introduction. 

Note that it should not come as a surprise that exp2 (with John's exploration) is suboptimal, 
since even in the full information case the basic exp2 strategy was provably suboptimal, see Theo- 
rem 1 . We conjecture that the correct order of magnitude for the minimax regret in the bandit case 
is m\/dn, as the above lower bound suggests. 

A promising approach to resolve this conjecture is to consider again the OSMD approach. 
However we believe that in the bandit case, one has to consider Legendre functions with non- 
diagonal Hessian (on the contrary to the Legendre functions considered so far in this paper). Aber- 
nethy, Hazan, and Rakhlin [1] propose to use a self-concordant barrier function for the polytope 
Conv(A). Then they randomly perturb the point Xt given by OSMD using the eigenstructure of 
the Hessian. This approach leads to a regret upper bound of order md^/6n log n for > when 
Conv{A) admits a ^-self-concordant barrier function. Unfortunately, even when there exists a 
0(l)-self concordant barrier, this bound is still larger than the conjectured optimal bound by a 
factor \fd. In fact, it was proved in [9] that in some cases there exist better choices for the Leg- 
endre function and the perturbation than those described in ['], even when there is a 0(l)-self 
concordant function for the action set. How to generalize this approach to the polytopes involved 
in online combinatorial optimization is a challenging open problem. 



A Proof of Theorem 1. 

For the sake of simplicity, we assume that (i is a multiple of 4 and that n is even. We consider the 
following subset of the hypercube: 

d/2 



A = \ ae{<d,lY ■.^ai = d/A and 

i=i 



Oi = l,Vi G {rf/2 + l;...,rf/2 + rf/4}j or (^ai = 1, Vi G {c//2 + c//4 + 1, . . . , d} 

That is, choosing a point in A corresponds to choosing a subset of d/A elements among the first 
half of the coordinates, and choosing one of the two first disjoint intervals of size d/A'm the second 
half of the coordinates. 

We prove that for any parameter 77, there exists an adversary such that Exp2 (with parameter 77) 
has a regret of at least tanh (^) , and that there exists another adversary such that its regret is at 
least min (^|||^, ff) • As a consequence, we have 

^ fnd^ {nd\ . f d\og2 nd^ 

sup Rn > max — tann — , mm , — 

^ \IQ V 8 / V 12r/ ' 12 

f f nd ^ fld\ d\og2\ nd\ / nd\ 

> mm max — tanh — , , — > mm A, — , 

16 V 8 / 12r/ y ' 12; - V ' 12/' 
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with 



. I'nd f7]d\ d\og2 

A = mm max — tann 



»?e[o,+oo) \16 \8/ 12r) 

( nd _ f'nd\ . f nd ^ fV^X d\og2 

> mm mm — tann — , mm max — tann 



»;d>8 16 V 8 / »)rf<8 V16 V 8 / 12// 

{ nd f ndrjd , , ^ d log 2 

> mm I — tanh(l), mm max I — — tanh(l) 



16 ' "r,d<8 V 16 8 ' " 127/ 



> min (^^tanh(l), ^j "^^-^^^^^^^^ > min (0.04^^,0.01^^/^77^) , 

where we used the fact that tanh is concave and increasing on M_|_. As n > ci, this implies the 
stated lower bound. 

First we prove the lower bound tanh (^) . Define the following adversary: 

1 if ? e {c//2 + l;...,rf/2 + c//4} and t odd, 
Zt{i) = { ^ if « e {d/2 + d/A + 1, . . . , d} and t even, 
otherwise. 

This adversary always puts a zero loss on the first half of the coordinates, and alternates between 
a loss oi d/A for choosing the first interval (in the second half of the coordinates) and the second 
interval. At the beginning of odd rounds, any vertex a ^ A has the same cumulative loss and 
thus Exp2 picks its expert uniformly at random, which yields an expected cumulative loss equal to 
nd/lQ. On the other hand, at even rounds the probability distribution to select the vertex a G ^ is 
always the same. More precisely, the probability of selecting a vertex which contains the interval 
{d/2 + (i/4 + 1, . . . , (i} (i.e, the interval with a d/A loss at this round) is exactly ij^f^^J^^^dja^-^ ■ This 
adds an expected cumulative loss equal to ^ i+cxp(-??d/4) • FiiiaUY' i^ote that the loss of any fixed 
vertex isnd/8. Thus, we obtain 

nd nd 1 nd nd , f^ld 

Rn = 1 ; n-T = — tanh — 

16 8 l + exp(-r/d/4) 8 16 V 8 

It remains to show a lower bound proportional to 1/r]. To this end, we consider a different 
adversary defined by 

i < d/A, 

zt{i) = { 1 if ie{d/A + l,...,d/2}, 




for some fixed e > 0. 

Note that against this adversary the choice of the interval (in the second half of the components) 
does not matter. Moreover, by symmetry, the weight of any coordinate in {d/A + 1, . . . , d/2} is 
the same (at any round). Finally, note that this weight is decreasing with t. Thus, we have the 
following identities (in the big sums i represents the number of components selected in the first 
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d/A components): 



ned Eae^:a,/2=i exp(-r/n2f a) 



4 Eae^exp(-r7n2fa) 
ned Ti^' if) Lt-\) eM-yjnd/A - rue)) 

ned TH!^' (t)(./t7^i)exp(r/m.) 

4 E£i(t)(.;i-J-p(,m.) 
n.ci E£rMl-f)(t)Ut.)exp(,m.) 
4 E£UT)(r.)exp(,zn.) 



Vd/4- 

where we used = (l - f ) {f,t^ in the last equality. Thus, taking e = min l) 

yields 



where the last inequality follows from Lemma 3 in the appendix. This concludes the proof of the 
lower bound. 



B Proof of Theorem 5 

The structure of the proof is similar to that of [?, Theorem 30], which deals with the simple case 
where m = 1. The main important conceptual difference is contained in Lemma 4, which is at the 
heart of this new proof. The main argument follows the line of standard lower bounds for bandit 
problems, see, e.g., [10]: The worst-case regret is bounded from below by by taking an average 
over a conveniently chosen class of strategies of the adversary. Then, by Pinsker's inequality, the 
problem is reduced to computing the KuUback-Leibler divergence of certain distributions. The 
main technical argument, given in Lemma 4, is for proving manageable bounds for the relevant 
KuUback-Leibler divergence. 

For the sake of simplifying notation, we assume that d is a multiple of m, and we identify 
{0, 1}^ with the set of m X [d/m) binary matrices {0, 1}™^ m . We consider the following set of 
actions: 

d/m 

A={ae {0, ir><^ : G {1, . . . , m}, ^ a(^, j) = 1}. 

j=i 

In other words, the player is playing in parallel m finite games with d/m actions. 

From step 1 to 3 we restrict our attention to the case of deterministic strategies for the player, 
and we show how to extend the results to arbitrary strategies in step 4. 

First step: definitions. 
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We denote by Jj^t G {1, . . . , m} the random variable such that 0^(2, Jj^j) = 1. That is, li^t is 
the action chosen at time t in the 2*^ game. Moreover, let r be drawn uniformly at random from 
{l,...,n}. 

In this proof we consider random adversaries indexed by A. More precisely, for a G ^, we 
define the a-adversary as follows: For any t G {1, . . . zt{i,j) is drawn from a Bernoulli 
distribution with parameter ^ — ea{i,j). In other words, against adversary a, in the z*'* game, the 
action j such that a{i,j) = 1 has a loss slightly smaller (in expectation) than the other actions. We 
denote by integration with respect to the loss generation process of the a-adversary. We write 
Pj Q for the probability distribution of a{i, li^r) when the player faces the a-adversary. Note that 
we have Pj,Q-(l) = Eq,^ Ylt=i ^c,{i,ii t)=i, hence, against the a-adversary, we have 

n m m 
t=l i=l i=l 

which implies (since the maximum is larger than the mean) 

maxE„ > V I 1 - — 7^— VPi„(l) I . (9) 

Second step: information inequality. 

Let P_i.Q be the probability distribution of a(i, Jj t-) against the adversary which plays like the 
a-adversary except that in the i*^ game, the losses of all coordinates are drawn from a Bernoulli 
distribution of parameter 1/2. We call it the (— i, a)-adversary and we denote by E(_j integration 
with respect to its loss generation process. By Pinsker's inequality. 



where KL denotes the KuUback-Leibler divergence. Moreover, note that by symmetry of the 
adversaries (— i, a). 



^ ' ' ' a:{-i,a)={-i,f3) 

^ ' ' P&A ' a:{-i,a)={-i,p) 

(d/m)"^ ^ d/m 



m 

7' 

and thus, thanks to the concavity of the square root. 



(10) 
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Third step: computation o/KL(P_j q,, ^i.a) with the chain rule. 

Note that since the forecaster is deterministic, the sequence of observed losses (up to time 
n) Wn G {0, . . . , m}" uniquely determines the empirical distribution of plays, and, in particular, 
the probability distribution of li^r) conditionally to Wn is the same for any adversary. Thus, 
if we denote by (respectively P" ^ ^ the probability distribution of Wn when the forecaster 
plays against the a-adversary (respectively the (— z, a)-adversary), then one can easily prove that 
KL(P_i Q,,Pj Q,) < KL(P"-^,P^). Now we use the chain rule for KuUback-Leibler divergence 
iteratively to introduce the probability distributions P^ of the observed losses Wt up to time t. 
More precisely, we have, 

KL(P!!,„,P^) 

n 

= KL(Pi,„,Pi) + Y. E ^t^Uwt^^)K^FUM^t-l),Ki■\wt-l)) 

t=2 «)t_ie{0,...,m}'-i 
n 

= KL (^0, B',) l„(,7,,)=i + E E P*_-;,K_i)KL J , 

t=2 wt-i:a{i,Ii^i)=l 

where Bwt-i and B'^^__^ are sums of m Bernoulli distributions with parameters in {1/2, 1/2 — e} 
and such that the number of BemouUis with parameter 1/2 in Bwt-i is equal to the number of 
BemouUis with parameter 1/2 in B'^^_^ plus one. Now using Lemma 4 (see below) we obtain. 

In particular, this gives 

Summing and plugging this into (1 1) we obtain (again thanks to (10)), for e < 



1 sr^^ / N ITT' /Sn 



(d/m)™ ^ ' - d \ d 

To conclude the proof of (8) for deterministic players one needs to plug this last equation in (9) 
along with straightforward computations. 

Fourth step: Fubini's theorem to handle non-deterministic players. 

Consider now a randomized player, and let E,rand denote the expectation with respect to the 
randomization of the player. Then one has (thanks to Fubini's theorem), 

id m)'^ ^ * ' id m)'^ ^ "Z^^t ' 

Now note that if we fix the realization of the forecaster's randomization then the results of the 
previous steps apply and, in particular, one can lower bound ^.^^^-^^ Sae^ X1"=i ('^f-^t ~ ci^z) 
as before (note that a is the optimal action in expectation against the a-adversary). 
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C Technical lemmas. 



Lemma 3 For any k G W,for any 1 < c < 2, we have 

Elo(i-Vfe)(-)V 



> 1/3. 



Proof Let /(c) denote the expression on the left-hand side of the inequality. Introduce the random 
variable X, which is equal to i G {0, . . . , A;} with probability (^)^cY Y!j=o ^jf^^- We have 
/'(c) = m_X{l-X/k)\-\%{X)W.{l-X/k) = -^VarX < 0. So the function / is decreasing 
on [1,2], and therefore it suffices to consider c = 2. Numerator and denominator of the left-hand 
side differ only by the factor 1—i/k.K lower bound for the left-hand side can thus be obtained by 
showing that the terms for i close to k are not essential to the value of the denominator. To prove 
this, we may use Stirling's formula which implies that for any k >2 and z G [1, A; — 1], 

ky{ k Vk ^ A^ ^ (^Y( ^ ^ ^1/12 



hence 



^2%i{k - i) \i/ Vi7 \k-i) ^2'Ki{k - 



i) \k — i) 2Tri{k — i) \i J \i ' \k — iJ 2Tii 



Introduce \ = i/k and x(A) = ^2\^i^X)'^{i-\) ■ We have 

o„-l/3 /hX 2 1/6 

Lemma 3 can be numerically verified for k < 10^. We now consider k > 10^. For A > 
0.666, since the function x can be shown to be decreasing on [0.666, 1], the inequality (^) 2* < 
[xiO-QQQ)] \y,osmxn holds. We have x(0.657)/x(0.666) > 1.0002. Consequently, for k > 10^ 
we have [x(0.666)]*^ < 0.001 x [x(0.657)]7^^- So for A > 0.666 and k > 10^ we have 

^ T < 0.001 X [x(0.657)]'^ < [x(0.657)]' 



2tt X 0.666 X P • lOOOvrP 

2e-i/3 



2 



Ae[0.656,0.657] IOOOtt/c 

1 fk-' 



< max ] 2\ (13) 

IOOOk je{i,...,fc-i}n[o,o.666fc) \i J 

where the last inequality comes from (12) and the fact that there exists i G {1,...,A; — 1} such that 
i/k e [0.656, 0.657]. Inequality (13) implies that for any i G {1, . . . ,k}, we have 

^ \i J 1000 ie{i,...,fc-i}n[o,o.666fc) V W 1000 ^ \i J 

0.666fc<j<fc ^ ^ ' \ / 0<j<0.666fc ^ ^ 
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To conclude, introducing A = Y.o<i<o.6mk we have 

Elo(l-V^)©'2^ . (1 - 0.666)A^ 1 



Elo©(.y2^ A + O.OOIA -3 



Lemma 4 Let i and n be integers with | < f < ^ < n. Let p,p' ,q,pi, . . . ,Pn be real numbers in 
(0, 1) with q G {p,p'}, Pi = ■ ■ ■ = Pe = q and pi+i = ■ ■ ■ = Pn- Let B (resp. B') be the sum of 
n + 1 independent Bernoulli distributions with parameters p,pi, . . . ,Pn ( resp. p',pi, . . . , pn). We 
have 

2{p'-pf 



KL(-B, B') < 



p'){n + 2)q' 



Proof Let Z, Z' , Zi, . . . , Zn be independent Bernoulli distributions with parameters p,p\pi, . . . , pn 
Define S = Yll=i ^ = Y17=i+i ^^'^ V = Z + S. By a slight and usual abuse of notation, we 
use KL to denote KuUback-Leibler divergence of both probability distributions and random vari- 
ables. Then we may write (the inequality is an easy consequence of the chain rule for KuUback- 
Leibler divergence) 

KL{B, B') = KL{{Z + S)+T, {Z' + S)+T) 
< KL{{Z + S,T),{Z' + S, T)) 
= KL{Z + S, Z' + S). 

Let Sk = P(5' = A;) for A; = —1, 0, ...,£+ 1. Using the equalities 

^£-k _ 1 ^-k+lf i ^ 1 i-k+1 



which holds for 1 < k < i + obtain 

i+i 



pSk^l + (1 -p)sk 



k=0 

e+1 



^P(V^ = A;) log 



k=0 



p'Sk^l + (1 -p')sk 



1±4 f p^k + {l-p)U-k + l) 



k=0 q 



p>l_2k+{l-p'){i-k + l] 



Fl , ip-q)V+{l-p)q{i+h . ,1,, 
^'''[ip'-q)V+il-p')qii+l),^- ^''^ 
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First case: q = p'. 

By Jensen's inequality, using that KV = p'{£ + 1) + p — p' in this case, we get 

KHZ ^s,z' + s)< log ( ip-p'MV)Hi-P)PV^m 

V {l-p')p'{£ + l) J 

= iog (^ip-pT + i^-p')p'i^+^y 

"^°^(^^ (T^TMTTiyJ - ii-p')p'{e + i) ■ 

Second case: q = p- 

In this case, V is a binomial distribution with parameters i + 1 and p. From (14), we have 



{l-p')p'{i + l) 
{p-p'f \ {p-p'f 



KL{Z + S,Z' + S)< -Elog 



{p' - p)V + {1 - p')p{i + 1) 
il-p)pi£+l) 



^, , (p' -p)(V -EV)' , 



To conclude, we will use the following lemma. 

Lemma 5 The following inequality holds for any x > xq with xq G (0, 1); 

(a; -1)2 



log(a:) < — (x — 1) 



2xo 



Proof Introduce f{x) = -{x - 1) + + log(x). We have f'{x) = -1 + ^ + and 

f"{x) = — From /'(xq) = 0, we get that /' is negative on (xq, 1) and positive on (1, +oo). 
This leads to / nonnegative on [xq, +oo). ■ 



Finally, from Lemma 5 and (15), using xq = we obtain 



1—p ' 

p'-p yE[{v-Evy 



2xo 

p'-p Y{i + i)p{i-pf 



{l-p)p{i+l)J 2(1 -p') 

ip' - pf 



2{i - p'){^ + i)p 
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